Generalised Brown Clustering and Roll-Up Feature Generation
نویسندگان
چکیده
Brown clustering is an established technique, used in hundreds of computational linguistics papers each year, to group word types that have similar distributional information. It is unsupervised and can be used to create powerful word representations for machine learning. Despite its improbable success relative to more complex methods, few have investigated whether Brown clustering has really been applied optimally. In this paper, we present a subtle but profound generalisation of Brown clustering to improve the overall quality by decoupling the number of output classes from the computational active set size. Moreover, the generalisation permits a novel approach to feature selection from Brown clusters: We show that the standard approach of shearing the Brown clustering output tree at arbitrary bitlengths is lossy and that features should be chosen insead by rolling up Generalised Brown hierarchies. The generalisation and corresponding feature generation is more principled, challenging the way Brown clustering is currently understood and applied.
منابع مشابه
Improving Quality of Hierarchical Clustering for Large Data Series
Brown clustering is a hard, hierarchical, bottom-up clustering of words in a vocabulary. Words are assigned to clusters based on their usage pattern in a given corpus. The resulting clusters and hierarchical structure can be used in constructing class-based language models and for generating features to be used in natural language processing (NLP) tasks. Because of its high computational cost, ...
متن کاملOptimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines
In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...
متن کاملOptimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines
In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...
متن کاملAn automatic rule base generation method for fuzzy pattern recognition with multiphased clustering
This paper presents a new approach for the automatic generation of fuzzy rule bases for pattern recognition. The general idea of the approach is to use and enhance the fuzzy c-means clustering algorithm. The rule base is generated through an iterative feature clustering approach. The automatic extraction of features is repeated until the generated rule base is giving an unequivocal answer. Alth...
متن کاملComputational evaluation of the homogeneity of composites processed by accumulative roll bonding (ARB)
A new computational method based on MATLAB was used to study the effect of different parameters on the homogeneity of composites produced by a severe plastic deformation technique known as accumulative roll bonding. For a higher number of passes, the degree of particle agglomeration and clustering decreased, and an appreciable homogeneity was obtained in both longitudinal and transverse directi...
متن کامل